Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation.
نویسندگان
چکیده
Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called "sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation" (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data such as RACE, CAGE, and EST into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation. The SLIDE software package is available at https://sites.google.com/site/jingyijli/SLIDE.zip.
منابع مشابه
Clustering of Short Read Sequences for de novo Transcriptome Assembly
Given the importance of transcriptome analysis in various biological studies and considering thevast amount of whole transcriptome sequencing data, it seems necessary to develop analgorithm to assemble transcriptome data. In this study we propose an algorithm fortranscriptome assembly in the absence of a reference genome. First, the contiguous sequencesare generated using de Bruijn graph with d...
متن کاملMsiq: Joint Modeling of Multiple Rna-seq Samples for Accurate Isoform Quantification by Wei
Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a highthroughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structures, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate isoform quantification from RNA-seq data is challengi...
متن کاملMSIQ: Joint Modeling of Multiple RNA-seq Samples for Accurate Isoform Quantification
Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a highthroughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structure, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate quantification of RNA isoforms from RNA-seq data is a ...
متن کاملStatistical Methods for Overdispersion in mRNA-Seq Count Data
Recent developments in Next-Generation Sequencing (NGS) technologies have opened doors for ultra high throughput sequencing mRNA (mRNA-seq) of the whole transcriptome. mRNA-seq has enabled researchers to comprehensively search for underlying biological determinants of diseases and ultimately discover novel preventive and therapeutic solutions. Unfortunately, given the complexity of mRNA-seq dat...
متن کاملBellerophontes: A RNA-Seq data analysis framework for chimeric transcripts discovery based on accurate fusion model
Motivation: Next generation sequencing technology allows the detection of genomic structural variations, novel genes and transcript isoforms from the analysis of high throughput data. In this work, we propose a new framework for the detection of fusion transcripts through short paired-end reads which integrates splicing-driven alignment and abundance estimation analysis, producing a more accura...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Proceedings of the National Academy of Sciences of the United States of America
دوره 108 50 شماره
صفحات -
تاریخ انتشار 2011